Topological network alignment uncovers biological function and phylogeny 
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Sequence comparison and alignment has had an enormous impact on our understanding of 
evolution, biology, and disease. Comparison and alignment of biological networks will likely 
have a similar impact. Existing network alignments use information external to the networks, 
such as sequence, because no good algorithm for purely topological alignment has yet been 
devised. In this paper, we present a novel algorithm based solely on network topology, that 
can be used to align any two networks. We apply it to biological networks to produce by far 
the most complete topological alignments of biological networks to date. We demonstrate that 
both species phylogeny and detailed biological function of individual proteins can be extracted 
from our alignments. Topology-based alignments have the potential to provide a completely 
new, independent source of phylogenetic information. Our alignment of the protein-protein 
interaction networks of two very different species — yeast and human — indicate that even dis- 
tant species share a surprising amount of network topology with each other, suggesting broad 
similarities in internal cellular wiring across all life on Earth. 

1 Introduction 
1.1 Background 

A network (or graph) is a collection of nodes (or vertices), and connections between them called edges. 
Graphs are used to describe, model, and analyze an enormous array of phenomena,''^ including physical 
systems like electrical power grids and communication networks, social systems like networks of friendships 
or corporate and political hierarchies, physical relationships such as residue interactions in a folded protein 
or software systems such as call graphs or expression and syntax trees. 

A graph G{V, E), or G for brevity, has node set V and edge set E. The sheer number and diversity 
of possible graphs (about 2^" ) of them exist given n nodes) makes graph classification and comparison 
problems difficult. One particular comparison problem is called subgraph isomorphism, which asks if one 
graph G exists as an exact subgraph of another graph H{U,F). This problem is NP-complete, which means 
that no efficient algorithm is known for solving it.^ Network alignment^ is the more general problem of 
finding the best way to "fit" G into H even if G does not exist as an exact subgraph of H. Some networks, 
such as the biological ones that we consider below, may contain noise, i.e. missing edges, false edges, 
or both.^ In these cases, and also due to biological variation, it is not even obvious how to measure the 
"goodness" of an inexact fit. One measure could be to assess the number of aligned edges — that is, the 
percentage of edges in E that are aligned to edges in F. We call this the "edge correctness" (EC). However, 
it is possible for two alignments to have similar ECs, one of which exposes large, dense, contiguous, and 
topologically complex regions that are similar in G and H, while the other fails to expose such regions of 
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similarity. Additionally, although EC can easily be used to measure the quality of an alignment after the fact, 
it is not clear how to use it to direct an alignment algorithm; in fact, maximizing EC is an NP-hard problem 
since it implies solving the subgraph isomorphism problem. Thus, other strategies must be sought to guide 
the alignment process. 

In the biological context, comparing networks of different organisms in a meaningful manner is ar- 
guably one of the most important problems in evolutionary and systems biology.^ Analogous to sequence 
alignments between genomes, alignments of biological networks can be useful because we may know a lot 
about some of the nodes in one network and almost nothing about topologically similar nodes in the other 
network; then, specialized knowledge about one may tell us something new about the other. Network align- 
ments can also be used to measure the global similarity between complete networks of different species. 
Given a group of such biological networks, the matrix of pairwise global network similarities can be used to 
infer phylogenetic relationships. 

In this paper, we introduce a novel method for the alignment of two networks that is based solely on the 
mutual similarity of their network topology. As such, this algorithm could be applied to any two networks, 
not just biological ones. For example, our algorithm can be applied to road maps or social networks, which 
obviously have no genetic or protein sequence associated with them. We apply our method to the alignment 
of two protein-protein interaction (PPI) networks and demonstrate that our alignment exposes far more topo- 
logically complex regions of similarity than existing methods can find. We also use our method to compute 
the pairwise all-to-all network similarity matrix between a group of species, and then build a phylogenetic 
tree that bears a striking resemblance to the one based on sequence comparison. The significance of these 
results are that they extract statistically significant meaning from a new source of information — ^pure net- 
work topology — that is independent of sequence or any other commonly used biological information. We 
believe that the results in this paper just barely scratch the surface of the information that can be extracted 
from network topology. 

1.2 Our Approach 

Analogous to sequence alignments, there exist local and global network alignments. Thus far, the major- 
ity of methods used for alignment of biological networks have focused on local alignments. ^"^^ With local 
alignments, mappings are chosen independently for each local region of similarity. However, local afign- 
ments can be ambiguous, with one node having different pairings in different local alignments. In contrast, a 
global network alignment provides a unique alignment from every node in the smaller network to exactly one 
node in the larger network, even though this may lead to inoptimal matchings in some local regions. Local 
network alignments are generally not able to identify large subgraphs that have been conserved during evo- 
lution.^'^ Global network alignment has been studied previously in the context of biological networks, ^^"^^ 
but most existing methods incorporate some a priori information about nodes such as sequence similarities 
of proteins in PPI networks,^''' or they use some form of learning on a set of "true" alignments.'^ In con- 
trast, our alignments are based solely on topological information and do not require learning. This makes 
our method applicable to any type of network, not just biological ones. 

We focus on topology instead of protein sequence because we aim to discover biological knowledge that 
is encoded in the PPI network topology. Since proteins aggregate to perform a function instead of acting in 
isolation, analyzing complex wirings around a protein in a PPI network could give deeper insights into inner 
working of cells than analyzing sequences of individual genes. Furthermore, network topology and protein 
sequences might give insights into different slices of biological information and thus, one could loose much 
information by focusing on sequence alone. Although protein sequence similarity correlates with functional 
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similarity, there exist proteins with 100% sequence identity that have different functional roles. Thus, re- 
stricting analysis to sequences might give incorrect functional assignments. Similarly, although high protein 
sequence similarity correlates with similarity in 3-dimensional structure, sequence-similar proteins can have 
structures that differ significantly from one another. Thus, sequence-based homology analyses may mask 
important structural and functional information. On the other hand, since the structure of a protein is ex- 
pected to define the number and type of its potential interacting parters in the PPI network, sequence-similar 
but structurally-dissimilar proteins are expected to have different PPI network topological characteristics. 
Moreover, entirely different sequences can produce identical structures. ^^'^^ In cases where such proteins 
are expected to share a common function, sequence-based function prediction would fail, where network 
topology-based one would not. Finally, we show that both sequence and topology have similar predictive 
power with respect to Gene Ontology (GO) terms^^ (Supplementary Figure 1), demonstrating that network 
topology can provide as much functional information as protein sequences. Since our goal is to uncover 
biological knowledge encoded in the topology of PPI networks, our alignments do not use protein sequence 
information. Thus, our method can align any type of network, not just biological ones. Note, however, that 
inclusion of sequence component into the cost function of our method is trivial (see Section 4), but this is 
out of the scope of the manuscript. 

Obviously, if one is to build meaningful alignments based solely upon network topology, one must first 
have a highly constraining measure of topological similarity. The simplest (and weakest) description of 
the topology of a node is its degree, which is the number of edges that touch it. Our much more highly 
constraining measure is a generalization of the degree of a node. We define a graphlet as a small, connected, 
induced subgraph of a larger network.^*^^-^ An induced subgraph on a node set X C y of G is obtained by 
taking X and all edges of G having both end-nodes in X. Figure 1 shows all the graphlets on 2, 3, 4, and 
5 nodes. For a particular node i; in a large network, we define a vector of "graphlet degrees"^^ that counts 
the number of each kind of graphlet that touch v (Figure 2). This vector, or signature, of v describes the 
topology of its neighborhood and captures its interconnectivities out to a distance of 4 (see Section 4. 1 and 
Figure 2)?^ This measure is superior to all previous measures, since it is based on all up to 5-node graphlets, 
which is practically enough due to the small-world nature of many real-world networks.^^ 

For our purposes, an alignment of two networks G and H consists of a set of ordered pairs (x, y), where 
X is a node in G and y is a node in H. Our algorithm, called GRAAL (GRAph ALigner), incorporates facets 
of both local and global alignment. We match pairs of nodes originating in different networks based on their 
signature similarity^^ (see Section 4.1), where a higher signature similarity between two nodes corresponds 
to a higher topological similarity between their extended neighborhoods (out to distance 4). The cost of 
aligning two nodes is modified to align the densest parts of the networks first; the cost is reduced as the 
degrees of both nodes increase, since higher degree nodes with similar signatures provide a tighter constraint 
than correspondingly similar low degree nodes (see Section 4.2 and the Supplementary Information); a is 
a parameter in [0,1] that controls the contribution of the node signature similarity to the cost function, the 
other contribution being simply the degree of the node (see Section 4.2). In the case of two node alignments 
comparing equally, the tie is broken randomly. Thus, different runs of the alignment algorithm can produce 
different results, although we generally find that a deterministic "core" alignment remains across all runs. 

We align each node in the smaller network to exactly one node in the larger network. The matching 
proceeds using a technique analogous to the "seed and extend" approach of the popular BLAST^^ algorithm 
for sequence alignment: we first choose a single "seed" pair of nodes (one node from each network) with 
high signature similarity. We then expand the alignment radially outward around the seed as far as practical 
using a greedy algorithm (see Section 4.2). Although local in nature, our algorithm produces large and dense 
global alignments. By "dense" we mean that the aligned subgraphs share many edges, which would not be 
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the case in a low-quality or random alignment. We believe that the high quality of our alignments is based 
less on the details of the extension algorithm and more on having a good measure of pair-wise topological 
similarity between nodes.^^ 

2 Results and Discussion 

2.1 Pairwise Alignment of Yeast and Human PPI Networks 

Using GRAAL, we align the human PPI network of Radivojac et al.^^ to the Collins et al. yeast PPI net- 
work,^' which we call "humanl" and "yeast2," respectively. We chose yeast as our second species because 
currently it has a high quality PPI network, with 16,127 interactions (edges) among 2,390 proteins (nodes). 
The "best" alignment (defined below) found by GRAAL aligns 1,890 of the edges in yeast2 to edges in 
humanl. Thus, the edge correctness (EC) of our alignment is 11.72%. There are 970 nodes involved in these 
"correct" edge alignments, representing 40% of all yeast2 nodes. We obtained similar EC for aligning other 
yeast^^'^^ and human^**"^*' networks (Supplementary Figure 2). The best alignment is defined as follows. 
Due to the existence of the a parameter in the cost function (as explained above) and some randomness in 
the GRAAL algorithm (see Section 4.2 and the Supplementary Information for details), the actual align- 
ments and ECs vary across different values of a, and across different runs of the algorithm for the same a. 
With this in mind, the best alignment is the alignment with the highest EC over all values of a, and over all 
runs for the given a. The highest EC is obtained for a of 0.8; the minimum EC over all runs for this a is 
higher than the maximum EC over all runs for any other a. Thus, we focus on alignments produced for a of 
0.8. Variation of EC over different runs for this a is small, with minimum and maximum EC of 11.5% and 
11.12%, respectively. Moreover, intersection of alignments from up to 40 different runs at a of 0.8 contains 
1,433 pairs, i.e., about 60% of the entire alignment. We call this intersection the core alignment. 

In addition to counting aligned edges, it is important that the aligned edges cluster together to form 
large and dense connected subgraphs, in order to uncover such regions of similar topology. We define a 
common connected subgraph (CCS) as a connected subgraph (not necessarily induced) that appears in both 
networks. The largest CCS in our best alignment (Figure 3A) has 900 interactions amongst 267 proteins, 
which comprises 11.2% of the proteins in the yeast2 network. Our second largest CCS has 286 interactions 
amongst 52 nodes, depicted in Figure 3B. The entire common subgraph is presented in Supplementary 
Figure 3. 

2.2 Comparison witli Otlier Metliods 

GRAAL uncovers CCSs that are substantially larger and denser than those produced by currently published 
algorithms. The best currently published global alignment of similar networks is the alignment of yeast and 
fly by IsoRank,^^ which uses sequence information in addition to topological information. It aligns 1,420 
edges, but its largest CCS contains just 35 nodes and 35 edges. Our largest CCS aligns 25.7 times as many 
edges and 7.6 times as many nodes in human-yeast than IsoRank does in fly-yeast. Our second largest CCS 
has a similar number of nodes to IsoRank's largest, but is 8.2 times denser in terms of edges. Furthermore, 
we applied IsoRank to our yeast2-humanl data using only topological information. We found that it aligns 
628 interactions (giving an edge correctness of only 3.89%), with its largest CCS having just 261 inter- 
actions among 116 proteins. Recently, IsoRankN, an algorithm for global alignment of multiple networks, 
has been introduced. However, a comparison with GRAAL is not feasible, since the alignment output 
of the two algorithms is different. While GRAAL's output is a list of one-to-one node mappings between 
the networks being aligned, IsoRankN's alignment contains sets of network-aligned proteins, where no two 
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sets overlap, but each set can contain more than one node (i.e., many-to-many node mapping) from each of 
the networks being aligned; thus, IsoRankN's output can not be quantified topologically with EC. Another 
popular global network alignment method is Graemlin.'^ We do not compare our alignment to one pro- 
duced by Graemlin because Graemlin requires a variety of other input information, including phylogenetic 
relationships between the species being aligned. In contrast, GRAAL's output can be used to infer phylo- 
genetic relationships. Finally, other methods potentially better than IsoRank exist; however, their current 
implementations failed to process networks of the size of yeastl and humanl§. 

2.3 Statistical Significance of GRAAL's Yeast-Human Alignment 

In the following three paragraphs, we look at three distinct ways in which to judge the statistical significance 
of our alignment: first, we judge the quality of our alignment compared to a random alignment of these two 
particular networks; second, we comment on the amount of similarity found between yeast and human in our 
alignment; and third, we interpret the biological significance of our aUgnment. Section 4 and Supplementary 
Information provide more details on all of the above. 

Given a random alignment of yeast2 to human 1, the probability of obtaining an edge correctness of 
11.72% or better (p-value) is less than 7 x 10^^. The probability of obtaining a large CCS would be signifi- 
cantly smaller, so this represents a weak upper bound on our j?- value. 

Judging the amount of similarity found between the yeast2 and human 1 networks in our alignment 
requires us to state carefully what we are comparing against. If we align with GRAAL networks drawn 
from several different random graph models^^ that have the same number of nodes and edges as yeast2 and 
human 1, we find that the edge correctness between random networks is significantly lower than the edge 
correctness of our yeast2-humanl aUgnment. For example, aligning two Erdos-Renyi random graphs with 
the same degree distribution as the data ("ER-DD") gives an edge correctness of only about 0.31 ± 0.22%. 
Similar alignments of Barabasi- Albert type scale-free networks ("SF-BA"),^^ stickiness model networks 
("STICKY"),^^ or 3-dimensional geometric random graphs ("GE0-3D"),^^ give edge correctness scores of 
only 2.86 ± 0.57%, 5.89 ± 0.39% and 8.8 ± 0.39%, respectively. Accepting GEO-3D as the best available 
null model (see Section 4.3), the p-value of our yeast2-humanl afignment is at most 8.4 x 10~^. This tells 
us that yeast and human, two very different species, enjoy more network similarity than chance would allow. 

We measure the biological significance of our alignment by counting how many of our aligned pairs 
share common Gene Ontology (GO) terms. GO terms succinctly describe the many biological proper- 
ties that a given protein may have. For this analysis, we consider the "complete" GO annotation data set, 
containing all GO annotations, independent of GO evidence code. GO annotation data was downloaded 
in September 2009. Across our entire best yeast2-humanl afignment, 45.1%, 15.6%, 5.1%, and 2.0% of 
aligned protein pairs share at least one, two, three, and four GO terms, respectively. Compared to random 
afignments, the p-values for these percentages are all in the 10~^ to 10~^ range. Furthermore, the results 
improve across GRAAL's core yeast2-humanl alignment: 50.9%, 19.3%, 7.3%, and 3.0% of aligned protein 
pairs share at least one, two, three, and four GO terms, respectively; the p-values for these percentages are 
all in the 10~^ to 10~^ range. Our results are better then those achieved by IsoRank. In the global alignment 
produced by IsoRank 44.2%, 14.1%, 4.1%, and 1.5% of aligned protein pairs share at least one, two, three, 
and four GO terms in common, respectively. Similarly, if we restrict our analysis only to the largest CCS, 
in GRAAL's CCS, the percentages are 67.2%, 22.0%, and 5.2% for sharing at least 1, 2, and 3 common GO 
terms, respectively, while in IsoRank's CCS, these percentages are only 60.6%, 1 1 .9%, and 0%, respectively. 
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2.4 Application to Protein Function Prediction 

With the above vaUdations in hand, we beUeve that GRAAL's aUgnments can be used to predict biological 
characteristics (i.e., GO molecular function (MF), biological process (BP), and cellular component (CC)) of 
un-annotated proteins based on their alignments with annotated ones. 

Here, we distinguish between two different sets of GO annotation data: the complete set described 
above, containing all GO annotations, independent of GO evidence codes, and "bio-based" set, containing 
GO annotations obtained by experimental evidence codes only (see^^ for details). Since in the complete 
GO annotation data set, many GO terms were assigned to proteins computationally (e.g., from sequence 
alignments), this set is biologically less confident than the bio-based one. We make predictions with respect 
to both GO annotation data sets, as described below. 

First, we analyze GRAAL's best yeast2-humanl alignment (i.e., the aUgnment with the highest EC over 
all runs for alpha of 0.8, as explained in Section 2.1) to identify protein pairs where one of the proteins 
is annotated with a "root" GO term: GO:0003674 for MF, GO:0008150 for BP, or GO:0005575 for CC, 
signifying that a protein is expected to have a MF, BP, or CC, respectively, but that no information was 
available as of the date of annotation. Next, we check if aUgned partners of such proteins are annotated 
with a known MF, BP, or CC GO term, correspondingly, with respect to both the complete and bio-based GO 
annotation data sets. If so, we assign all known MF, BP, or CC GO terms to the protein currently annotated 
by the corresponding "root" GO term. 

With respect to the complete GO data set, we predict MF for 44 human and 435 yeast proteins, BP 
for 53 human and 157 yeast proteins, and CC for 52 human and 54 yeast proteins. Since GO database 
offers a list with an exphcit note that a protein is not associated with a given GO term, we were able to 
examine directly whether our predictions contradicted this Ust. We found no contradictions in GO database 
for any of the yeast or human proteins with respect to MF or BP; we found contradiction only for one of 
our human predictions with respect to CC. We also attempted to validate all of our predictions using the 
literature search and text mining tool CiteXplorer.^^ For 34.1%, 43.4%, and 46.2% of our MF, BP, and CC 
human predictions, respectively, this tool found at least one article mentioning the protein of interest in the 
context of at least one of our predictions for that protein. For yeast, these percentages are 42.07%, 3.18%, 
and 12.96%, respectively. Our human and yeast predictions made with respect to the complete GO data set 
are presented in Supplementary Tables 1 and 2, respectively. 

With respect to bio-based GO data set, we predict MF for 30 human and 214 yeast proteins, BP for 
42 human and 41 yeast proteins, and CC for 45 human and 17 yeast proteins. None of these predictions 
were contradicted in the GO database. We validated with CiteXplorer 10%, 4.76%, and 20% of our bio- 
based MF, BP, and CC human predictions, respectively. We also validated 48.1% of our bio-based MF yeast 
predictions. Our human and yeast predictions made with respect to bio-based GO data set are presented in 
Supplementary Tables 3 and 4, respectively. 

2.5 Reconstruction of Phylogenetic Trees by Aligning Metabolic Pathways Across Species 

Finally, we describe a completely different application: how purely topological alignment of metaboUc net- 
works obtained by GRAAL can be used to recover phylogenetic relationships. 

Several studies analyzing metabolic pathways in different species have aimed to find an evolutionary 
relationship between those species and construct their phylogenetic trees. ^^"^^ Different distance metrics 
have been used for constructing phylogenetic trees. For example, similarities between pathways have been 
computed from sequence similarities between corresponding substrates and enzymes from individual path- 
ways'^ or as a combination of similarities of enzymes from individual metabolic networks and topologies 
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of these networks. The similarity of enzymes is based on the similarity of their sequences, structures, or 
Enzyme Commission numbers.'^" The topological similarity of two pathways has been based on the similar- 
ity between nodes (corresponding to enzymes) and the similarity of their neighborhoods, measuring whether 
a node influences similar nodes and whether it is influenced by similar nodes itself. In addition, topological 
similarity of metabolic pathways combining global network properties, such as the diameter and clustering 
coefficient, and similarities of shared node (i.e., enzyme) neighborhoods has been used.^^ 

Therefore, although related attempts exist,^^ they all still use some biological or functional information 
such as sequence similarities to define node similarities and derive phylogenetic trees from pathways. Since 
we use only network topology to define protein similarity, our information source is fundamentally differ- 
ent. Thus, our algorithm recovers phylogenetic relationships (but not the evolutionary timescale of species 
divergence) in a completely novel and independent way from all existing methods for phylogenetic recovery. 

It has been shown that PPI network structure has subtle effects on the evolution of proteins and that 
reasonable phylogenetic inference can only be done between closely related species .^^ In the KEGG pathway 
database, there are 17 Eucaryotic organisms with fully sequenced genomes,^^ of which seven are protists, six 
are fungi, two are plants, and two are animals. Here we focus on protists (see the Supplementary Information 
for fungi). For each organism, we extract the union of all metabolic pathways from KEGG, and then find 
all-to-all pairwise network alignments between species using GRAAL. The edge correctness scores between 
pairs of protist networks range from 29.6% to 76.7%. We create phylogenetic trees using the average distance 
algorithm^, with pairwise edge correctness as the distance measure. We compare our phylogenetic trees to 
the pubhshed ones" obtained from genetic or amino acid sequence ahgnments."*^ "*'* Figure 4 presents our 
phylogenetic tree for protists and shows that it is very similar to that found by sequence comparison.'*^ We 
can estimate the statistical significance of our tree by measuring how it compares to trees built from random 
networks of the same size as the metabolic networks (see the Supplementary Information); we find that 
the p- value of our tree is less than 1.3 x 10~^. Phylogenetic trees based on alignments made by IsoRank 
do not differ significantly from random ones (see the Supplementary Information). We also find that the 
topologies of the entire metabolic networks of Cryptosporidium parvum and Cryptosporidium hominis are 
very similar, having edge correctness of 75.72%. This result is encouraging since these organisms are two 
morphologically identical species of Apicomplexan protozoa with 97% genetic sequence identity, but with 
strikingly different hosts^^ that contribute to their divergence.^^ 

Note that all of the metabolic networks that we aUgn are derived from a mix of both experimental 
data and genetic sequence-based data. Thus, the fact that we recover almost the same tree as sequence-based 
methods is a strong validation of our method. Once the KEGG database gets updated to have metabolic path- 
ways that are determined solely by experiment, our phylogenetic trees will provide a new and completely 
independent, objective source of phylogenetic information, as well as a novel, independent verification of 
the sequence-based phylogeny. Given that our phylogenetic tree is slightly different from that produced by 
sequence, there is no reason to believe that the sequence-based one should a priori be considered the correct 
one. Sequence-based phylogenetic trees are built based on multiple alignment of gene sequences and whole 
genome alignments. Multiple afignments can be misleading due to gene rearrangements, inversions, trans- 
positions, and translocations that occur at the substring level. Furthermore, different species might have an 
unequal number of genes or genomes of vastly different lengths. Whole genome phylogenetic analyses can 
also be misleading due to non-contiguous copies of a gene or non-decisive gene order. Finally, the trees 
are built incrementally from smaller pieces that are "patched" together probabilistically,^^ so probabilistic 
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errors in the tree are expected. Our tree suffers from none of the above problems, although it may suffer 
from others that are presumably independent of those above. 

3 Conclusions 

Network alignment has appUcations across an enormous span of domains, from social networks to software 
call graphs. In the biological domain, the mass of currently available network data will only continue to 
increase and we beheve that high-quality topological ahgnments can yield new and pivotal insights into 
function, evolution, and disease. 



4 Methods 



A graph G{V, E),orG for brevity, has node set V and edge set E. Given n = |y | nodes, the maximum num- 
ber of undirected edges is M = n(n— 1)/2, and the number of possible undirected graphs on n nodes is thus 
2^. The sheer number and diversity of possible graphs makes graph classification and comparison problems 
difficult. One of those problems is called subgraph isomorphism: given two arbitrary graphs GiV, E) and 
H{U, F) such that \V\ < \U\, does G exist as a subgraph of H7 That is, is there a discrete map a : V ^ U 
defined \/v & V such that {x,y) e E ^ {ax, ay) G F? This problem is NP-complete, which means that 
no efficient algorithm is known for finding the mapping a — ^the only known generally applicable way is to 
search through all possible mappings from V to Since the number of such mappings is exponential in 
both \ V\ and \U\, this is considered an intractable problem. 



4.1 Graphlet Degree Signatures and Signature Similarities 

GRAAL aligns a pair of nodes originating in different networks based on a similarity measure of their local 
neighborhoods.^^ This measure generahzes the degree of a node, which counts the number of edges that the 
node touches, into the vector of graphlet degrees, counting the number of graphlets that the node touches, 
for all 2-5-node graphlets (see Figure 1). Note that the degree of a node is the first coordinate in this vector, 
since an edge (graphlet Gq in Figure 1) is the only 2-node graphlet. Since it is topologically relevant to 
distinguish between, for example, nodes touching graphlet Gi at an end or at the middle, the notion of 
automorphism orbits (or just orbits, for brevity) is used. By taking into account the "symmetries" between 
nodes of a graphlet, there are 73 different orbits across all 2- to 5-node graphlets. We number the orbits from 
to 72.^^ The full vector of 73 coordinates is the signature of a node (Figure 2). 

The signature of a node provides a novel and highly constraining measure of local topology in its vicinity 
and comparing the signatures of two nodes provides a highly constraining measure of local topological 
similarity between them. The signature similarity^^ is computed as follows. For a node u E G,Ui denotes the 
i*'* coordinate of its signature vector, i.e., ui is the number of times node u is touched by an orbit i in G. The 
distance DJu, v) between the i*'* orbits of nodes u and v is defined as Dj(u, v) = Wj x \^°9{ut+i)-iog(v^+i)\ 
where Wi is a weight of orbit i that accounts for dependencies between orbits; for example, differences in 
counts of orbit 3 will imply differences in counts of all orbits that contain a triangle, such as orbits 10-14, 
25, 26, etc. and thus, a higher weight is assigned to orbit 3, w^, than to the orbits that contain it.^^ The total 

distance D(u, v) between nodes u and v is defined as: D(u, v) = ^t2° — -. The distance D(u, v) is in [0, 

1), where distance means that signatures of nodes u and v are identical. Finally, the signature sumlarity, 
S{u, v), between nodes u and v is S{u, v) = 1 — D{u, v). 
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4.2 GRAAL (GRAph ALigner) Algorithm 



When aligning two graphs G{V, E) and H{U,F), GRAAL first computes costs of ahgning each node in 
G with each node in H. The cost of ahgning two nodes takes into account the signature similarity between 
them, modified to reduce the cost as the degrees of both nodes increase, since higher degree nodes with 
similar signatures provide a tighter constraint than correspondingly similar low degree nodes (see the Sup- 
plementary Information), a is the parameter in [0,1] that controls the contribution of the signature similarity 
to the cost function; that is, 1 — a is the parameter that controls the contribution of node degrees to the cost 
function. In this way, we align the densest parts of the networks first. 

It is also possible to add protein sequence component to the cost function, to balance between topological 
and sequence similarity of aligned nodes. This can be done trivially by adding another parameter /? to the 
cost function that would control the contribution of the current topologically-derived costs, while 1 — /? 
would control the contribution of node sequence similarities to the total cost function; similar has been 
done by other relevant studies. ^^'^^'^^ However, as we aim to extract only biological information encoded 
in network topology, analyzing how balancing between the topological and sequence similarity affects the 
resulting alignments is out of the scope of our manuscript and is the subject of future work. 

GRAAL chooses as the initial seed a pair of nodes {v,u), v G V and u £ U, that have the smallest cost. 
Ties are broken randomly, which results in slightly different results across different runs. Once the seed is 
found, GRAAL builds "spheres" of all possible radii around nodes v and u. A sphere of radius r around 
node V is the set of nodes Sg{v, r) = {x E V : d{v, x) = r} that are distance r from v where the distance 
d{v, x) is the length of the shortest path from v to x. Spheres of the same radius in two networks are then 
greedily aligned together by searching for the pairs {v',u') : v' G SG{v,r) and u' € SH{u,r) that are 
not already ahgned and that can be aligned with the minimal cost. When all spheres around the seed {v, u) 
have been aUgned, some nodes in both networks may remain unaUgned. For this reason, GRAAL repeats 
the same algorithm on a pair of networks {G^, H^) for p = 1,2, and 3, and searches for the new seed again, 
if necessary. We define a network G^ as a new network G^ = (V, E^) with the same set of nodes as G 
and with {v, x) E EP if and only if the distance between nodes v and a; in G is less than or equal to p, i.e., 
daiv, x) < p. Note that G^ = G. Using G^, p > 1 allows us to align a path of length p in one network to 
a single edge in another network, which is analogous to allowing "insertions" or "deletions" in a sequence 
alignment. GRAAL stops when each node from G is aligned to exactly one node in H. 

GRAAL produces global alignments. We note that optimal global alignments are not necessarily unique. 
Given any particular cost function, there may be many distinct alignments that all share the optimal cost. 
In this paper, we analyze just one specific alignment that we believe is a good one, although it may not 
be optimal even according to our measure. Enumerating all optimal (or at least good) alignments requires 
extending our algorithm to allow many-to-many mappings between the nodes in the two networks, and is the 
subject of the future work. Thus, many more predictions of equal vaUdity to those in this paper are likely to 
be possible. However, we empirically demonstrate that a large portion (about 60%) of the entire alignment is 
conserved across different runs of the algorithm; thus, this core ahgnment is independent of the randomness 
in the algorithm. 

The algorithm's pseudo code and details about the complexity analysis are presented in the Supplemen- 
tary Information. The software and data used in this paper are available upon request. 

4.3 Statistical Significance of our Yeast-Human Afignment 

Given a GRAAL alignment of two networks G{V, E) and H{U, F), we compute the probability of obtaining 
a given or better edge correctness score at random. For this purpose, an appropriate null model of random 
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alignment is required. A random alignment is a random mapping / between nodes in two networks G{V, E) 
and H{U, F), f : V ^ U. GRAAL produces global alignments, so that all nodes in the smaller network 
(smaller in terms of the number of nodes) are aligned with nodes in the larger network. In other words, 
/ is defined \fv G V. This is equivalent to aligning each edge from G{V, E) with a pair of nodes (not 
necessarily an edge) in H{U, F). Thus, we define our null model of random alignment as a random mapping 
g : E ^ U X U. We define ni = \V\, n2 = \U\, mi = \E\, and m2 = \F\. We also define the number 
of node pairs in H as p = V^kll^nH^ and let EC = x% be the edge correctness of the given alignment. 
We let k = [mi x EC] = [mi x x] be the number of edges from G that are aligned to edges in H. 
Then, the probabiUty P of successfully aligning k or more edges by chance is the tail of the hypergeometric 



Now we describe how to estimate the statistical significance of the amount of similarity we find between 
yeast2 and humanl in our alignment. To do that, we need to estimate how much similarity one would 
expect to find between two random networks and doing that, in turn, requires us to specify how we generate 
model random networks. Given two models that purport to fit a set of observations, we generally consider 
as superior the one that has fewer tunable parameters. For example, the STICKY and ER-DD models are 
constructed to preserve the degree distribution of the data. These and other data-driven models of random 
networks^^"^° are thus expected to model particular PPI networks better than theoretical network models. 
However, they are not an appropriate choice to judge whether the yeast2 and humanl networks share a 
significant amount of structural similarity; this is because these models are strongly conditioned on these 
particular networks and thus they might transfer onto the model networks the similarities between yeast2 
and humanl that we aim to detect in the first place. Thus, we search for a well-fitting theoretical null 
model. Arguably the best currently known theoretical model for PPI networks, requiring the fewest tunable 
parameters, is the geometric random graph model ("GEO"),^'^'^^'^^ in which proteins are modeled as existing 
in a metric space and are connected by an edge if they are within a fixed, specified distance of each other. 

Although early, incomplete PPI datasets were modeled well by scale-free networks because of their 
power-law degree distributions,^^' it has been argued that such degree distributions were an artifact of 
noise.^^"^^ In the light of new PPI network data, several studies^"' have presented compelling evidence 
that the structure of PPI networks is closer to geometric than to scale-free networks. This was done by com- 
paring frequencies of graphlets in real- world and model networks^^ and by measuring a highly-constraining 
agreement between "graphlet degree distributions."^^ Finally, it has been shown that PPI networks can be 
successfully embedded into a low-dimensional Euclidean space, thus directly confirming that they have a 
geometric structure.^^ The superior fit of the GEO model to PPI networks over other models may not be 
surprising, since it can be biologically motivated. In particular, the currently accepted paradigm for evo- 
lution is based on a series of gene duplication and mutation events. We outUne our crude geometric gene 
duplication model. We model genes, and proteins as their products, as existing in some biochemical metric 
space. Although the dimension and axes of this space are not obvious, we assume that when a parent gene is 
dupUcated, the child gene starts at a similar location in the metric space, since it is structurally identical to 
the parent and thus inherits interactions from the parent. As mutations and "evolutionary optimization" act 
on the child, it drifts away from the parent in the metric space. The child may preserve some of the parent's 
interacting partners, but it may also establish new interactions with other genes. Similarly, in a geometric 
graph, the closer two nodes are to each other, the more interactors they will have in common, and vice- versa. 
In addition to PPI networks, GEO is a well-fitting theoretical null model for other biological networks, e.g., 
brain function networks^^ and protein structure networks.^^ 



distribution: P = E" 



. For our yeast2-humanl aUgnment, we find P f« 7 x 10 
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Accepting GEO as the optimal null model for PPI networks, we compute the probability of obtaining 
the EC of 11.72% in our alignment of yeast2 and humanl to be 8.4 x 10^'^. We do so by aligning with 
GRAAL pairs of GEO networks of the same size as yeast2 and humanl and by applying the following form 
of the Vysochanskij-Petunin inequality: P{\X — ^| > Xa) < Since GEO networks that are aligned 
have the same number of nodes and edges as the data, it is reasonable to assume that the distribution of their 
alignment scores is unimodal. Thus, we use the Vysochanskij-Petunin inequality, since it is more precise 
than Chebyshev's inequality for unimodal distributions. More details are supplied in the Supplementary 
Information. 

Acknowledgments 

We thank M. Rasajski for computational assistance. This project was supported by the NSF CAREER nS- 
0644424 grant. 

References 

1. Colizza, v., Flammini, A., Serrano, M.A., Vespignani, A.: Detecting rich-club ordering in complex networks. Nature Physics 
2(2006) 110-115 

2. Guimera, R., Sales-Pardo, M., Amaral, L.A.N.: Classes of complex networks defined by role-to-role connectivity profiles. 

Nature Physics 3 (2007) 63-69 

3. Cook, S.: The complexity of theorem-proving procedures. In: Proc. 3rd Ann. ACM Symp. on Theory of Computing: 1971; 
New York, Assosiation for Computing Machinery (1971) 151-158 

4. Sharan, R., Ideker, T: Modeling cellular machinery through biological network comparison. Nature Biotechnology 24(4) (Apr 
2006) 427^33 

5. Venkatesan, K. et al.: An empirical framework for binary interactome mapping. Nature Methods 6(1) (2009) 83-90. 

6. Kelley, B.P, Bingbing, Y., Lewitter, F., Sharan, R., Stockwell, B.R., Ideker, T: PathBLAST: a tool for alignment of protein 
interaction networks. Nucl. Acids Res. 32(Web Server issue) (2004) W83-W88 

7. Berg, J., Lassig, M.: Local graph alignment and motif search in biological networks. PNAS 101 (2004) 14689-14694 

8. Flannick, J., Novak, A., Balaji, S., Harley, H., Batzglou, S.: Graemlin general and robust alignment of multiple large interaction 
networks. Genome Res 16(9) (2006) 1169-1181 

9. Liang, Z., Xu, M., Teng, M., Niu, L.: NetAlign: a web-based tool for comparison of protein interaction networks. Bioinfor- 
matics 22(17) (2006) 2175-2177 

10. Berg, J., Lassig, M.: Cross-species analysis of biological networks by Bayesian alignment. Proceedings of the National 
Academy of Sciences 103(29) (2006) 10967-10972 

11. Singh, R., Xu, J., Berger, B.: Pairwise global alignment of protein interaction networks by matching neighborhood topology. 
In: Research in Computational Molecular Biology. Springer (2007) 16-31 

12. Flannick, J., Novak, A.F., Do, C.B., Srinivasan, B.S., Batzoglou, S.: Automatic parameter learning for multiple network 
aUgnment. In: RECOMB. (2008) 214-231 

13. Zaslavskiy, M., Bach, F., Vert, J.P.: Global alignment of protein-protein interaction networks by graph matching methods. 
Bioinformatics 25(12) (2009) 1259-1267 

14. Komili, S., Farny, N.G., Roth, F.P, Silver, PA.: Functional specificity among ribosomal proteins regulates gene expression. 
Cell 131(3) (2007) 557-571 

15. Watson, J.D., Laskowski, R.A., Thornton, J.M.: Predicting protein function from sequence and structural data. Current opinion 
in structural biology 15(3) (2005) 275-284 

16. Whisstock, J.C., Lesk, A.M.: Prediction of protein function from protein sequence and structiu'e. Q Rev Biophys 36(3) (2003) 

307-340 

17. Kosloff, M., Kolodny, R.: Sequence-similar, structure-dissimilar protein pairs in the pdb. Proteins 71(2) (2008) 891-902 

18. Laurents, D.V., Subbiah, S., Levitt, M.: Different protein sequences can give rise to highly similar folds through different 
stabilizing interactions. Protein Sci 3(1 1) (1994) 19381944 

19. The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nature Genetics 25 (2000) 25-29. 

20. Przulj, N., Cornell, D.G., Jurisica, I.: Modeling interactome: Scale-free or geometric? Bioinformatics 20(18) (2004) 3508-3515 

21. Przulj, N., Cornell, D.G., Jurisica, I.: Efficient estimation of graphlet frequency distributions in protein-protein interaction 
networks. Bioinformatics 22(8) (2006) 974-980 



11 



22. Przulj, N.: Biological network comparison using graphlet degree distribution. Bioinformatics 23 (2007) el77-el83 

23. Milenkovic, T., Przulj, N.: Uncovering biological network function via graphlet degree signatures. Cancer Iirformatics 6 (2008) 
257-273 

24. Watts, D.J., Strogatz, S.H.: Collective dynamics of 'small-world' networks. Nature 393 (1998) 440^42 

25. Altschul, S.F., Gish, W., Miller, W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215 (1990) 
403^10 

26. Radivojac, P., Peng, K., Clark, W.T., Peters, B.J., Mohan, A., Boyle, S.M., D., M.S.: An integrated approach to inferring 
gene-disease associations in humans. Proteins 72(3) (2008) 1030-1037 

27. Collins, S., Kemmeren, P., Zhao, X., Greenblatt, J., Spencer, R. Holstcgc. R, Weissman, J., Krogan, N.: Toward a comprehen- 
sive atlas of the physical interactome of saccharomyces cerevisiae. Molecular and Cellular Proteomics 6(3) (2008) 439^50 

28. Stark, C, Breitkreutz, B., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M.: BioGRID: A general repository for interaction 
datasets. Nucleic Acids Research 34 (2006) D535-D539 

29. Peri, S., Navarro, J.D., Kristiansen, T.Z., Amanchy, R., Surendranath, V., Muthusamy, B., Gandhi, T.K., Chandrika, K.N., 
Deshpande, N., Suresh, S., Rashmi, B.P., Shanker, K., Padma, N., N iranjan, V., Harsha, H.C., Talreja, N., Vrushabendra, 
B.M., Ramya, M.A., Yatish, A.J., Joy, M., S hivashankar, H.N., Kavitha, M.P., Menezes, M., Choudhury, D.R., Ghosh, N., 
Saravana, R., Chandran, S., Mohan, S., Jonnalagadda, C.K., Prasad, C.K., Kumar-Sinha, C, Deshpande, K.S., Pandey, A.: 
Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res 32 Database issue (2004) 
D497-501 

30. Rual, J., Venkatesan, K., Hao, T, Hirozane-Kishikawa, T., Dricot, A., Li, N., Berriz, G.F., Gibbons, RD., Dreze, M., Ayivi- 
Guedehoussou, N., Klitgord, N., Simon, C, Boxem, M., Milstein, S., Rosenberg, J., Goldberg, D.S., Zhang, L.V., Wong, 
S.L., Franklin, G., Li, S., Albala, J.S., Lim, J., Fraughton, C, Llamosas, E., Cevik, S., Bex, C, Lamesch, P., Sikorski, R.S., 
Vandenhaute, J., Zoghbi, H.Y., Smolyar, A., Bosak, S., Sequerra, R., Doucette-Stamm, L., Cusick, M.E., Hill, D.E., Roth, F.P., 
Vidal, M.: Towards a proteome-scale map of the human protein-protein interaction network. Nature 437 (2005) 1173-78 

31. Liao, C.S., Lu, K., Baym, M., Singh, R., Berger, B.: Isorankn: spectral methods for global alignment of multiple protein 
networks. Bioinformatics 25(12) (2009) 1253-258 

32. Milenkovic, T., Lai, J., Przulj, N.: Graphcrunch: a tool for large network analyses. BMC Bioinformatics 9(70) (2008) 

33. Barabasi, A., Albert, R.: Emergence of scaling in random networks. Science 286(5439) (1999) 509-512 

34. Przulj, N., Higham, D.: Modelling protein-protein interaction networks via a stickiness index. Journal of the Royal Society 
Interface 3(10) (2006) 711-716 

35. Labarga, A., Valentin, F, Andersson, M., Lopez, R.: Web services at the european bioinformatics institute. Nucleic Acids 
Research 3S(Web Server issue) (2007) W6-W11 

36. Forst, C, Schulten, K.: Phylogenetic analysis of metabolic pathways. J Mol Evol 52 (2001) 471^89 

37. Heymans, M., Singh, A.: Deriving phylogenetic trees from the similarity analysis of metabolic pathways. Bioinformatics 19 
Suppl. 1 (2003) il38-il46 

38. Zhang, Y, Li, S., Skogerb, G., Zhang, Z., Zhu, X., Zhang, Z., Sun, S., Lu, H., Shi, B., Chen, R.: Phylophenetic properties of 
metabolic pathway topologies as revealed by global analysis. BMC Bioinformatics 7:252 (2006) 

39. Suthram, S., Sittler, T., Ideker, T.: The Plasmodium protein network diverges from those of other eukaryotes. Nature 438 
(2005) 108-112 

40. Webb, E.C.: Enzyme nomenclature 1992: recommendations of the Nomenclature Committee of the International Union of 
Biochemistry and Molecular Biology on the nomenclature and classification of enzymes, the University of Michigan (1992) 

41. Agrafioti, I., Swire, J., Abbott, J., Huntley, D., Butcher, S., Stunpf, M.P.: Comparative analysis of the saccharomyces cerevisiae 
and caenorhabdits elegans protein interaction networks. BMC Evolutionary Biology 5(23) (2005) 

42. Kanehisa, M., Goto, S.: Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28 (2000) 27-30 

43. Pennisi, E.: Modernizing the tree of life. Science 300(5626) (2003) 1692 - 1697 

44. Keeling, P., Luker, M., Palmer, J.: Evidence from beta-tubulin phylogeny that microsporidia evolved from within the fungi. 
Mol. Biol. Evol. 17(1) (2000) 23-31 

45. Tanriverdi, S., Widmer, G.: Differential evolution of repetitive sequences in Cryptosporidium parvum and cryptosporiditmi 
hominis. Infect. Genet. Evol. 6(2) (2006) 113-22 

46. Xu, P., Widmer, G., Wang, Y, Ozaki, L., Alves, J., Serrano, M., Puiu, D., Manque, P., Akiyoshi, D., Mackey, A., Pearson, 
W, Dear, P., Bankier, A., Peterson, D., Abrahamsen, M., Kapur, V., Tzipori, S., Buck, G.: The genome of cryptosporiditmi 
hominis. Nature 431(7012) (2004) 1107-12 

47. Out, H., Sayood, K.: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(16) (2003) 
2122-2130 

48. Thorne, T., Stumpf, M.: Generating confidence intervals on biological networks. BMC Bioinformatics 8(1) (2007) 467 

49. Snijders, T.A.: Markov chain monte carlo estimation of exponential random graph models. Journal of Social Structure 3(2) 
(2002) 2^0 



12 



50. Kuchaiev, O., Przulj, N.: Learning the structure of protein-protein interaction networks. Pacific Symposium on Biocomputing 
(2009) 39-50 

51. Higham, D., Rasajski, M., Przulj, N.: Fitting a geometric graph to a protein-protein interaction network. Bioinformatics 24(8) 

(2008) 1093-1099 

52. Jeong, H., Mason, S.P., Barabasi, A.L., Oltvai, Z.N.: Lethality and centrality in protein networks. Nature 411(6833) (2001) 
41-2 

53. Stumpf, M.P.H., Wiuf, C, May, R.M.: Subnets of scale-free networks are not scale-free: Sampling properties of networks. 
Proceedings of the National Academy of Sciences 102 (2005) 4221^224 

54. Han, J.D.H., Dupuy, D., Berlin, N., Cusick, M.E., Vidal, M.: Effect of sampling on topology predictions of protein-protein 
interaction networks. Nature Biotechnology 23 (2005) 839-844 

55. de Silva, E., Thome, T., Ingram, P., Agrafioti, 1., Swire, J., Wiuf, C, Stumpf, M.P.: The effects of incomplete protein interaction 
data on structural and evolutionary inferences. BMC Biol. 4(39) (2006) 

56. Przulj, N., Kuchaiev, O., Stevanovi6, A., Hayes, W.: Geometric evolutionary dynamics of protein interaction networks. Pacific 
Symposium on Biocomputing (2010) to appear 

57. Kuchaiev, O., Wang, P.T., Nenadic, Z., , Przulj, N.: Structure of brain functional networks. 31st Annual International Confer- 
ence of the IEEE Engineering in Medicine and Biology Society (2009) 

58. Milenkovic, T., Filippis, I., Lappe, M., Przulj, N.: Optimized null model for protein structure networks. PLoS ONE 4(6) (2009) 
e5967. 



13 




Fig. 1. All the connected graphs on up to 5 nodes. When appearing as an induced subgraph of a larger graph, we call them graphlets. 
They contain 73 topologically unique node types, called "automorphism orbits." In a particular graphlet, nodes belonging to the 
same orbit are of the same shade. Graphlet Go is just an edge, and the degree of a node historically defines how many edges it 
touches. We generalize the degree to a 73-component "graphlet degree" vector that counts how many times a node is touched by 
each particular automorphism orbit.^^ 
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Fig. 2. An illustration of how the degree of node v in the leftmost panel is generalized into its "graphlet degree vector," or "sig- 
nature," that counts the number of different graphlets that the node touches, such as triangles (middle panel) or squares (rightmost 
panel). Values of the 73 coordinates of the graphlet degree vector of node v, GDV{v), are presented in the table. 
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Fig. 3. The alignment of yeast2 and human 1 PPI networks. An edge between two nodes means that an interaction exists in both 
species between the corresponding protein pairs. Thus, the displayed networks appear, in their entirety, in the PPI networks of both 
species. (A) The largest common connected subgraph (CCS) consisting of 900 interactions amongst 267 proteins. (B) The second 
largest CCS consisting of 286 interactions amongst 52 proteins; each node contains a label denoting a pair of yeast and human 
proteins that are aligned. 
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Fig. 4. Comparison of the phylogenetic trees for protists obtained by genetic sequence alignments and by GRAAUs metabolic 
network alignments. Left: The tree obtained from genetic sequence comparison."*' Right: The tree obtained from GRAAL. The 
following abbreviations are used for species: CHO - Cryptosporidium hominis, DDI - Dictyostelium discoideum, CPV - Cryp- 
tosporidium parvum, PFA - Plasmodium falciparum, EHI - Entamoeba histolytica, TAN - Theileria annulata, TPV - Theileria 
parva. The species are grouped into the following classes: 'Alveolates," "Entamoeba," and "Cellular Slime mold." 
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